docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction by AdaWorldAPI · Pull Request #149 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-16T20:26:18Z

Captures the consumer-side architectural contract that the AdaWorldAPI/lance-graph spine is staging against this ndarray fork. A PRE-MERGE audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic violations across 5 consumer crates plus 3 missing primitives here that block clean remediation. This doc is the spec for what the W1a primitive queue must do; consumer-side migrations cannot proceed until these primitives ship.

Files

.claude/knowledge/vertical-simd-consumer-contract.md (NEW, 328 LOC) — the canonical W1a spec.
CLAUDE.md § Hard Rules (+1 line) — pointer + VPABSB callout + gating-relationship statement.

The pattern

ndarray's SIMD surface is designed AS-IF for our exact workloads: struct methods on typed wrappers (I8x16, U8x32, F32x16, U64x8, …) plus closure-parameterized batch primitives that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zero cfg(target_arch), zero runtime feature-detect — they call I8x16::from_i4_packed_u64(...), I8x16::saturating_abs(...), batch_packed_i4_16(..., |lanes, aux| { ... }). The polyfill (this repo) owns dispatch, chunking, tail handling, and scalar fallback.

P0: VPABSB correction

The original lance-graph PR #400 capture claimed _mm512_abs_epi8 saturates i8::MIN → 127 by ISA. This is wrong — VPABSB returns the same bit pattern for 0x80 (i.e., abs(i8::MIN) = i8::MIN, since +128 doesn't fit in i8). Codex caught this on PR #400; the binding correction is:

// AVX-512 saturating_abs
let raw_abs = _mm512_abs_epi8(self.0);
let clamped = _mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f));
I8x16(clamped)

The VPMINUB (unsigned-byte min) clamp remaps 0x80 (= 128 unsigned > 127) down to 0x7f. All other lanes are unchanged since abs(x) < 0x80 for x ≠ i8::MIN. NEON vqabsq_s8 is already hardware-saturating (the q suffix); scalar i8::saturating_abs is correct.

Mandatory parity test:

let input = I8x16::splat(i8::MIN);
assert_eq!(input.saturating_abs().lane_i8::<0>(), i8::MAX);

The widen-then-negate trick used in lance-graph PR #398's mul.rs is NOT a substitute — the new primitive must produce saturating semantics in the byte-wide register without widening, since downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops.

W1a queue — 5 primitives this PR specs

Each will be a separate PR (parallel review, tight scope):

TD	API	Consumer driving the spec
`TD-NDARRAY-SIMD-UNPACK-I4-16D`	`I8x16::from_i4_packed_u64` + `batch_packed_i4_16<E, F>` closure-batch	`lance-graph::mul::i4_eval::batch` (5 fns)
`TD-NDARRAY-SIMD-SATURATING-ABS-I8`	`I8x16::saturating_abs` (VPABSB + VPMINUB clamp)	lance-graph PR #398 Direction-B fix
`TD-NDARRAY-SIMD-GATHER`	`U16x8::gather_u16` + `palette_lookup_u8x8`	`bgz17/src/simd.rs:88`
`TD-NDARRAY-SIMD-PREFETCH`	`prefetch_read_t0/t1/t2` cross-arch	`bgz17/src/prefetch.rs:96,100`
`TD-NDARRAY-SIMD-POPCOUNT-U64`	`U64x8::popcnt` + `xor_popcount`	`holograph/hamming.rs`, `blasgraph/types.rs`

W1.5 — deferred primitives (gated on lance-graph::sigker certification)

Three more queued behind jc Pillar 11 activation: signature-PDE-sweep, randomized-projection, lyndon-pack. The W1a additions must be designed broad enough to compose with these — in particular, the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7 (randomized signatures).

Acceptance criteria for each W1a PR

All three backends (AVX*, NEON, scalar) — scalar is the correctness anchor
Doc-comment states saturating/overflow/signedness semantics explicitly
Mandatory parity test on randomized + edge-case corpus
No new is_*_feature_detected! outside src/hpc/simd_caps.rs
// SAFETY: comments on all unsafe blocks
Consumer site cited in PR description

The simd-savant agent on the lance-graph side runs PRE-MERGE against every W1a PR to verify compliance.

Cross-references

AdaWorldAPI/lance-graph PR #399 — introduced the simd-savant agent + autoattended-multiagent pattern doc
AdaWorldAPI/lance-graph PR #400 — architectural capture (alien-magic + EPIPHANIES + TECH_DEBT)
AdaWorldAPI/lance-graph PR #398 — codex P1 (NEON OOB) + P2 (i8::MIN divergence) — the trigger
AdaWorldAPI/lance-graph PR (open) — corrects the VPABSB claim in lance-graph's knowledge doc
Intel Intrinsics Guide: _mm512_abs_epi8, _mm512_min_epu8, _mm512_popcnt_epi64, _mm256_i32gather_epi32
ARM Architecture Reference: VQABS (vqabsq_s8), VCNT (vcntq_u8)

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

Generated by Claude Code

…tion Captures the consumer-side architectural contract that the AdaWorldAPI/ lance-graph spine is staging against this ndarray fork. A PRE-MERGE audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic violations across 5 consumer crates plus 3 missing primitives here that block clean remediation. This doc is the spec for what the W1a primitive queue must do; consumer-side migrations cannot proceed until these primitives ship. Files (2): 1. .claude/knowledge/vertical-simd-consumer-contract.md (NEW, 328L) The pattern: struct methods on typed wrappers + closure- parameterized batch primitives. Consumers see zero raw intrinsics and zero arch-specific cfg; the polyfill (this repo) owns runtime feature dispatch, lane chunking, tail handling, scalar fallback. **VPABSB correction (P0):** _mm512_abs_epi8 does NOT saturate i8::MIN — it returns the same bit pattern (0x80 → 0x80, which is still -128 when interpreted as i8). The lance-graph PR #400 codex P1 review caught the original claim that VPABSB saturates by ISA; the correct AVX-512 saturating_abs is _mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f)) The VPMINUB clamp remaps 0x80 (unsigned 128 > 127) down to 0x7f. NEON vqabsq_s8 is already hardware-saturating (q-suffix); scalar i8::saturating_abs is correct. Five W1a primitive specs with per-arch implementation hints, API surfaces, mandatory parity-test requirements, and consumer call sites: - TD-NDARRAY-SIMD-UNPACK-I4-16D: I8x16::from_i4_packed_u64 + batch_packed_i4_16<E, F> closure-batch (consumer: lance-graph mul::i4_eval::batch, 5 fns) - TD-NDARRAY-SIMD-SATURATING-ABS-I8: I8x16::saturating_abs (the VPABSB+VPMINUB fix above) - TD-NDARRAY-SIMD-GATHER: U16x8::gather_u16 + palette_lookup_u8x8 (consumer: bgz17/src/simd.rs:88) - TD-NDARRAY-SIMD-PREFETCH: prefetch_read_t0/t1/t2 cross-arch (consumer: bgz17/src/prefetch.rs:96,100) - TD-NDARRAY-SIMD-POPCOUNT-U64: U64x8::popcnt + U64x8::xor_popcount (consumer: holograph/hamming.rs, blasgraph/types.rs) Three W1.5 deferred primitives (gated on lance-graph:crates/sigker benchmarking + jc Pillar 11 activation): signature-PDE-sweep, randomized-projection, lyndon-pack. Mentioned so W1a additions are designed broad enough to compose with these later. Acceptance criteria: all three backends mandatory, parity test mandatory, saturating/overflow semantics documented, no new is_*_feature_detected! outside simd_caps, // SAFETY: on all unsafe blocks, consumer site cited in PR description. Cross-links to lance-graph knowledge doc, simd-savant card, E-SIMD-SWEEP-1 epiphany, TD-NDARRAY-SIMD-* / TD-SIMD-SWEEP-W* debt ledger, PR #398/#399/#400 history. Plus Intel + ARM intrinsic references, Hambly-Lyons 2010, Cuchiero-Schmocker- Teichmann 2021, Jirak 2016. 2. CLAUDE.md § Hard Rules (+1 row) "All new public pub fn in src/simd_*.rs follows the W1a consumer contract" — pointer to the doc, callout that VPABSB does NOT saturate i8::MIN, and statement of the gating relationship (missing primitives here force consumer-side raw-intrinsic violations, so additions here are blocking the consumer sweep). Once this lands, W1a workers can spawn against this branch in parallel: 5 small PRs against ndarray master, each implementing one primitive from the queue with the parity tests mandated by this doc. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c2161b4a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T20:28:24Z

+```
+
+**Per-arch implementation:**
+- **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather).


Avoid reading past u16 tables with 32-bit gathers

For gather_u16, using _mm256_i32gather_epi32 against a &[u16] can read four bytes starting at table[index]; when an allowed index is table.len() - 1, that crosses the end of the slice even though the later bounds contract only requires max(indices) < table.len(). If a future implementation follows this hint, release builds can perform an out-of-bounds SIMD load for valid last-element lookups instead of safely returning the final u16.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-16T20:28:24Z

+**Per-arch implementation:**
+- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`.
+- **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB).
+- **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade.


Preserve u64 lanes in the NEON popcount recipe

For lane-wise U64x8::popcnt, vaddvq_u8 reduces all bytes in a NEON vector to one scalar, so using it as described here would merge the counts of multiple u64 lanes instead of returning one count per lane. This only shows up on the NEON backend, where a future implementation following this contract would disagree with the scalar and AVX results for inputs such as [1, 0, ...] versus [0, 1, ...]; the hint should require a per-u64 widening/pairwise reduction instead.

Useful? React with 👍 / 👎.

Resolves add/add conflict on .claude/knowledge/vertical-simd-consumer-contract.md by taking master's version (PR #149 — the polished READ BY / P0 TRIGGERS form with agent routing). CLAUDE.md gains the W1a contract hard rule pointer from master. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

AdaWorldAPI merged commit d0627b8 into master May 16, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction#149

docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction#149
AdaWorldAPI merged 1 commit into
masterfrom
claude/vertical-simd-consumer-contract-w1a-spec

AdaWorldAPI commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 16, 2026

Files

The pattern

P0: VPABSB correction

W1a queue — 5 primitives this PR specs

W1.5 — deferred primitives (gated on lance-graph::sigker certification)

Acceptance criteria for each W1a PR

Cross-references

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants